Goto

Collaborating Authors

 New Brunswick


Large Language Models (LLMs) for Source Code Analysis: applications, models and datasets

arXiv.org Artificial Intelligence

Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing efficiency, accuracy, and automation. This paper explores the role of LLMs for different code analysis tasks, focusing on three key aspects: 1) what they can analyze and their applications, 2) what models are used and 3) what datasets are used, and the challenges they face. Regarding the goal of this research, we investigate scholarly articles that explore the use of LLMs for source code analysis to uncover research developments, current trends, and the intellectual structure of this emerging field. Additionally, we summarize limitations and highlight essential tools, datasets, and key challenges, which could be valuable for future work.


UNB StepUP: A footStep database for gait analysis and recognition using Underfoot Pressure

arXiv.org Artificial Intelligence

Gait refers to the patterns of limb movement generated during walking, which are unique to each individual due to both physical and behavioural traits. Walking patterns have been widely studied in biometrics, biomechanics, sports, and rehabilitation. While traditional methods rely on video and motion capture, advances in underfoot pressure sensing technology now offer deeper insights into gait. However, underfoot pressures during walking remain underexplored due to the lack of large, publicly accessible datasets. To address this, the UNB StepUP database was created, featuring gait pressure data collected with high-resolution pressure sensing tiles (4 sensors/cm$^2$, 1.2m by 3.6m). Its first release, UNB StepUP-P150, includes over 200,000 footsteps from 150 individuals across various walking speeds (preferred, slow-to-stop, fast, and slow) and footwear types (barefoot, standard shoes, and two personal shoes). As the largest and most comprehensive dataset of its kind, it supports biometric gait recognition while presenting new research opportunities in biomechanics and deep learning. The UNB StepUP-P150 dataset sets a new benchmark for pressure-based gait analysis and recognition. Please note that the hypertext links to the dataset on FigShare remain dormant while the document is under review.


FedNIA: Noise-Induced Activation Analysis for Mitigating Data Poisoning in FL

arXiv.org Artificial Intelligence

Federated learning systems are increasingly threatened by data poisoning attacks, where malicious clients compromise global models by contributing tampered updates. Existing defenses often rely on impractical assumptions, such as access to a central test dataset, or fail to generalize across diverse attack types, particularly those involving multiple malicious clients working collaboratively. To address this, we propose Federated Noise-Induced Activation Analysis (FedNIA), a novel defense framework to identify and exclude adversarial clients without relying on any central test dataset. FedNIA injects random noise inputs to analyze the layerwise activation patterns in client models leveraging an autoencoder that detects abnormal behaviors indicative of data poisoning. FedNIA can defend against diverse attack types, including sample poisoning, label flipping, and backdoors, even in scenarios with multiple attacking nodes. Experimental results on non-iid federated datasets demonstrate its effectiveness and robustness, underscoring its potential as a foundational approach for enhancing the security of federated learning systems.


Federated Continual Learning: Concepts, Challenges, and Solutions

arXiv.org Artificial Intelligence

Federated Continual Learning (FCL) has emerged as a robust solution for collaborative model training in dynamic environments, where data samples are continuously generated and distributed across multiple devices. This survey provides a comprehensive review of FCL, focusing on key challenges such as heterogeneity, model stability, communication overhead, and privacy preservation. We explore various forms of heterogeneity and their impact on model performance. Solutions to non-IID data, resource-constrained platforms, and personalized learning are reviewed in an effort to show the complexities of handling heterogeneous data distributions. Next, we review techniques for ensuring model stability and avoiding catastrophic forgetting, which are critical in non-stationary environments. Privacy-preserving techniques are another aspect of FCL that have been reviewed in this work. This survey has integrated insights from federated learning and continual learning to present strategies for improving the efficacy and scalability of FCL systems, making it applicable to a wide range of real-world scenarios.


A Study on the Importance of Features in Detecting Advanced Persistent Threats Using Machine Learning

arXiv.org Artificial Intelligence

Advanced Persistent Threats (APTs) pose a significant security risk to organizations and industries. These attacks often lead to severe data breaches and compromise the system for a long time. Mitigating these sophisticated attacks is highly challenging due to the stealthy and persistent nature of APTs. Machine learning models are often employed to tackle this challenge by bringing automation and scalability to APT detection. Nevertheless, these intelligent methods are data-driven, and thus, highly affected by the quality and relevance of input data. This paper aims to analyze measurements considered when recording network traffic and conclude which features contribute more to detecting APT samples. To do this, we study the features associated with various APT cases and determine their importance using a machine learning framework. To ensure the generalization of our findings, several feature selection techniques are employed and paired with different classifiers to evaluate their effectiveness. Our findings provide insights into how APT detection can be enhanced in real-world scenarios.


Encoded Spatial Attribute in Multi-Tier Federated Learning

arXiv.org Artificial Intelligence

This research presents an Encoded Spatial Multi-Tier Federated Learning approach for a comprehensive evaluation of aggregated models for geospatial data. In the client tier, encoding spatial information is introduced to better predict the target outcome. The research aims to assess the performance of these models across diverse datasets and spatial attributes, highlighting variations in predictive accuracy. Using evaluation metrics such as accuracy, our research reveals insights into the complexities of spatial granularity and the challenges of capturing underlying patterns in the data. We extended the scope of federated learning (FL) by having multi-tier along with the functionality of encoding spatial attributes. Our N-tier FL approach used encoded spatial data to aggregate in different tiers. We obtained multiple models that predicted the different granularities of spatial data. Our findings underscore the need for further research to improve predictive accuracy and model generalization, with potential avenues including incorporating additional features, refining model architectures, and exploring alternative modeling approaches. Our experiments have several tiers representing different levels of spatial aspects. We obtained accuracy of 75.62% and 89.52% for the global model without having to train the model using the data constituted with the designated tier. The research also highlights the importance of the proposed approach in real-time applications.


Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques

arXiv.org Artificial Intelligence

Control Flow Graphs and Function Call Graphs have become pivotal in providing a detailed understanding of program execution and effectively characterizing the behavior of malware. These graph-based representations, when combined with Graph Neural Networks (GNN), have shown promise in developing high-performance malware detectors. However, challenges remain due to the large size of these graphs and the inherent opacity in the decision-making process of GNNs. This paper addresses these issues by developing several graph reduction techniques to reduce graph size and applying the state-of-the-art GNNExplainer to enhance the interpretability of GNN outputs. The analysis demonstrates that integrating our proposed graph reduction technique along with GNNExplainer in the malware detection framework significantly reduces graph size while preserving high performance, providing an effective balance between efficiency and transparency in malware detection.


MACeIP: A Multimodal Ambient Context-enriched Intelligence Platform in Smart Cities

arXiv.org Artificial Intelligence

This paper presents a Multimodal Ambient Context-enriched Intelligence Platform (MACeIP) for Smart Cities, a comprehensive system designed to enhance urban management and citizen engagement. Our platform integrates advanced technologies, including Internet of Things (IoT) sensors, edge and cloud computing, and Multimodal AI, to create a responsive and intelligent urban ecosystem. Key components include Interactive Hubs for citizen interaction, an extensive IoT sensor network, intelligent public asset management, a pedestrian monitoring system, a City Planning Portal, and a Cloud Computing System. We demonstrate the prototype of MACeIP in several cities, focusing on Fredericton, New Brunswick. This work contributes to innovative city development by offering a scalable, efficient, and user-centric approach to urban intelligence and management.


XEdgeAI: A Human-centered Industrial Inspection Framework with Data-centric Explainable Edge AI Approach

arXiv.org Artificial Intelligence

Recent advancements in deep learning have significantly improved visual quality inspection and predictive maintenance within industrial settings. However, deploying these technologies on low-resource edge devices poses substantial challenges due to their high computational demands and the inherent complexity of Explainable AI (XAI) methods. This paper addresses these challenges by introducing a novel XAI-integrated Visual Quality Inspection framework that optimizes the deployment of semantic segmentation models on low-resource edge devices. Our framework incorporates XAI and the Large Vision Language Model to deliver human-centered interpretability through visual and textual explanations to end-users. This is crucial for end-user trust and model interpretability. We outline a comprehensive methodology consisting of six fundamental modules: base model fine-tuning, XAI-based explanation generation, evaluation of XAI approaches, XAI-guided data augmentation, development of an edge-compatible model, and the generation of understandable visual and textual explanations. Through XAI-guided data augmentation, the enhanced model incorporating domain expert knowledge with visual and textual explanations is successfully deployed on mobile devices to support end-users in real-world scenarios. Experimental results showcase the effectiveness of the proposed framework, with the mobile model achieving competitive accuracy while significantly reducing model size. This approach paves the way for the broader adoption of reliable and interpretable AI tools in critical industrial applications, where decisions must be both rapid and justifiable.


NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

arXiv.org Artificial Intelligence

Natural Question Answering (QA) datasets play a crucial role in developing and evaluating the capabilities of large language models (LLMs), ensuring their effective usage in real-world applications. Despite the numerous QA datasets that have been developed, there is a notable lack of region-specific datasets generated by native users in their own languages. This gap hinders the effective benchmarking of LLMs for regional and cultural specificities. In this study, we propose a scalable framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. Moreover, to demonstrate the efficacy of the proposed framework, we designed a multilingual natural QA dataset, MultiNativQA, consisting of ~72K QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers covering 18 topics. We benchmark the MultiNativQA dataset with open- and closed-source LLMs. We made both the framework NativQA and MultiNativQA dataset publicly available for the community. (https://nativqa.gitlab.io)